Shampoo Sales¶

This dataset describes the monthly number of sales of shampoo over a 3 year period.
The units are sales count and there are 36 observations.
Dataset source: Makridakis, Wheelwright and Hyndman (1998).

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly import offline
from plotly.subplots import make_subplots
import statsmodels.api as sm
from statsmodels.tsa.seasonal import seasonal_decompose
from matplotlib import pyplot as plt
from pandas.plotting import autocorrelation_plot
In [2]:
offline.init_notebook_mode()

Firstly, let's expore the dataset!

Reading the dataset¶

In [3]:
df = pd.read_csv('../datasets/shampoo.csv')

Think of time series as a list of numbers with some information about the time when those numbers were recorded. In this dataset we have monthly observations of Shampoo sales during three years:

In [4]:
df.head()
Out[4]:
Month Sales
0 1-01 266.0
1 1-02 145.9
2 1-03 183.1
3 1-04 119.3
4 1-05 180.3

In this particular dataset, we have 36 observations.

In [5]:
df.shape
Out[5]:
(36, 2)

It's also recommended we check the data types of each column to assure we can do the necessary operations.

In [6]:
df.dtypes
Out[6]:
Month     object
Sales    float64
dtype: object

We should also check for nan values. For this dataset, there are no nan values.

In [7]:
df.isna().sum()
Out[7]:
Month    0
Sales    0
dtype: int64

Consider this is a very small and simple dataset, so there's no much data cleaning necessary. We can, however, do some exploration, preprocessing and transformations.

Pre-processing and Exploration¶

Since we are working with time series, it's easier if we represent Month as a datetime type. To do that, we pad the year with a 0, so we can convert it using the pandas' to_datetime function. For this particular dataset we do not care about the actual year of those observations, but do note that the to_datetime function will convert years represented as zero-padded decimal numbers without century as 20xx years.

In [8]:
df['Month'] = df['Month'].apply(lambda x: x.zfill(5))

Now the dates are 0 padded.

In [9]:
df.head(2)
Out[9]:
Month Sales
0 01-01 266.0
1 01-02 145.9
In [10]:
df['Month'] = pd.to_datetime(df['Month'], format='%y-%m')

With the conversion they will look like this:

In [11]:
df.head(2)
Out[11]:
Month Sales
0 2001-01-01 266.0
1 2001-02-01 145.9

Since we are working with Pandas datetime we can check the frequency of the time series, which for this dataset we expect to be monthly. Pandas also has a handy way to do that using the infer_freq function:

In [12]:
pd.infer_freq(df['Month'])
Out[12]:
'MS'

MS stands for "month start frequency". You can check the offset aliases here

Plotly let us see the time plot of sales along the months. To do that, let's create a method that makes line plots, so our code is reusable.

In [13]:
def plot_line(df, x, y, title, template='simple_white'):
  fig = px.line(data_frame=df, x=x, y=y,
                title=title,
                template=template
                )
  fig.show()
In [14]:
plot_line(df, 'Month', 'Sales', 'Sales of Shampoo by month')

Looking at this plot, we can see some interesting features of the data, such as:

  • Sales in the first month of 2001 were stronger than in consecutive months, only reaching a new peak in november.
  • No month had zero sales.
  • Every month of 2003 had better sales than the same month in 2002.
  • September of 2003 is when the sales had the all time peak.
  • Sales tend to decrease after a strong month.


So we can kinda assume that there is a clear trend of increasing sales year to year, with some peaks and valleys along the way, but it's not totally possible to say the sales of every month are correlated with time or that there is a strong seasonal pattern in the data. Luckily, we have some tools to understand how this is happening and at what pace. A good way to do that is to decompose the time series components and to look at the sales difference between each month to see how much the sales are actually increasing or decreasing month by month.

Trend, seasonality and noise¶

One of the first things we can check are the components of a time series: trend and seasonality.

Trend let us see if there is a long-term increase or decrease in the data. Seasonality, on the other hand, is a pattern that occurs when a time series is affected by seasonal factors such as the time of the year or the day of the week.
Besides those, time series are also composed by noise, the random variation in the series which cannot be explained.

We can try to analyse the seasonality in the data using the information that we already know to ask some questions. For instance, we know that our data points are months of the year, we can try asking what is the impact of the months of the year in the sales. To do that, let's create a column that represents the month of the year.

In [15]:
df['month_of_the_year'] = df['Month'].dt.month_name()
df.head()
Out[15]:
Month Sales month_of_the_year
0 2001-01-01 266.0 January
1 2001-02-01 145.9 February
2 2001-03-01 183.1 March
3 2001-04-01 119.3 April
4 2001-05-01 180.3 May

Can we see any monthly pattern if plot the sales as three different lines? To do that, let's create a function named go_line_plot that creates a single Scatter plot, so we can reuse it later if we need simpler plots.

In [16]:
def go_line_plot(x, y, name):
    return go.Scatter(x=x, y=y,
                    mode='lines',
                    name=name)

Then we divide our dataset into three slices and add to the plot:

In [17]:
fig = go.Figure()
fig.add_trace(go_line_plot(df['month_of_the_year'].iloc[:12], df['Sales'].iloc[:12], 'year 1'))
fig.add_trace(go_line_plot(df['month_of_the_year'].iloc[12:24], df['Sales'].iloc[12:24], 'year 2'))
fig.add_trace(go_line_plot(df['month_of_the_year'].iloc[24:], df['Sales'].iloc[24:], 'year 3'))

fig.update_layout(title_text='Shampoo Sales by month',
                    template='simple_white')

fig.show()

Looking at this plot, it's still not very clear if there are any seasonality, especially because the sale of each month also has other elements such as trend and noise. An easy way to see the elements of a series is to decompose it. We can do a naive decomposition in Python using the statsmodels seasonal_decompose function. To do that, we first need to set the Month column as index. We will also drop the 'month_of_the_year' column.

In [18]:
df.set_index('Month',inplace=True)
df.drop('month_of_the_year', axis=1, inplace=True)

Consider that the time series is an aggregate of trend, seasonality and noise, we can think of those components as additive (as in the series being the result of the addition of the components) or multiplicative (the series is the result of the multiplication of the components). In the real world, data is not that simple and sometimes there are both additive and multiplicative component in the series, but seasonal_decompose requires that you specify whether the model is additive or multiplicative. For now, let's analyse both.

In [19]:
decomposed = seasonal_decompose(df['Sales'], model='additive')

Let's plot all the components together, so it's easier to analyse. To do that, we create a function that stacks multiple plots using Plotly.

In [20]:
def stacked_subplots(rows, dict_values, df, title):
    fig = make_subplots(rows=rows, cols=1)
    i=1
    
    for key, value in dict_values.items():
        fig.add_trace(
            go_line_plot(df[value[0]], df[value[1]], name=key),
            row=i, col=1
        )
        i += 1

    fig.update_layout(title_text=title,
                    template='simple_white',
                    autosize=False,
                    width=1040,
                    height=600)

    fig.show()

Let's now create a dataframe with all the components together.

In [21]:
def generate_time_series_components_df(decomposed):
    components = pd.DataFrame()
    components['index'] = decomposed.observed.index
    components['seasonal'] = decomposed.seasonal.values
    components['trend'] = decomposed.trend.values
    components['residual'] = decomposed.resid.values
    components['observed'] = decomposed.observed.values
    return components

First the additive model.

In [22]:
df_dec = generate_time_series_components_df(decomposed)

plots = {'Seasonality': ('index', 'seasonal'),
          'Trend': ('index', 'trend'),
          'Residual': ('index', 'residual'),
          'Observed': ('index', 'observed')
}
In [23]:
stacked_subplots(rows=4, dict_values=plots, df=df_dec, title='Additive Decomposed Time Series of Sales')

Now the multiplicative model (we can reuse the dictionary with the desired plots).

In [24]:
decomposed = seasonal_decompose(df['Sales'], model='multiplicative')

df_dec = generate_time_series_components_df(decomposed)
stacked_subplots(rows=4, dict_values=plots, df=df_dec, title='Multiplicative Decomposed Time Series of Sales')

Comparing both the additive and multpicative models, the plots are very similar, except the scale. Note that this function is considered a naive decomposition and there are more sophisticated methods that are preferred in more complex datasets. Also, we can see that since our time series is too short, there are not enough cycles to fully estimate the seasonal pattern, so half cycle (6 months) at each end is lost because we did not specify a special handling of endpoints.

Analysing both plots, we can now conclude that there is in fact an increasing trend in sales. The seasonality plot, on the other hand, shows us features of the data such as big increase in sales in November and big decrease in May. The residual plot is the part we cannot explain about this series.

Differencing¶

Differencing is a type of Time Series transformation used to stabilize the mean of the time series by removing changes in the level of a time series, and so eliminating (or reducing) trend and seasonality (Hyndman, 2018). This is a way to make a non-stationary time series stationary. Pandas already has a diff function we can use to create another column in our dataset.

In [25]:
df.reset_index(inplace=True)
In [26]:
df['diff'] = df['Sales'].diff()
In [27]:
plot_line(df, 'Month', 'diff', 'Sales difference by month')

If we want to remove temporal structure, we can perform the diff operation again. The idea is that we repeat this process until all temporal dependencies are removed. The number of times the differencing operations is performed is called the difference order.

In [28]:
df['second_order_diff'] = df['Sales'].diff().diff()
plot_line(df, 'Month', 'second_order_diff', 'Second-order sales difference by month')

We then create a dictionary with our (x, y) tuples and plot label.

In [29]:
values = {'sales': ('Month', 'Sales'),
          'lag-1': ('Month', 'diff'),
          'lag-2': ('Month', 'second_order_diff')
}
In [30]:
stacked_subplots(rows=3, dict_values=values, df=df, title='Shampoo Sale Analysis')
In [31]:
df.drop(['diff', 'second_order_diff'], axis=1, inplace=True)

Autocorrelation¶

Most of the time, we analyse time series data to make some sort of forecast. We could use our Shampoo Sales data to predict sales for the next months. But the main question is, can we actually make predictions for the next month if we know the current and past months sales? We can make such analysis using Autocorrelation. Autocorrelation is a representation of the degree of similarity between a time series and a lagged version of itself over successive time intervals. High autocorrelation sugests strong relationship between a variable’s present value and any past values.

Before getting there, let's try to plot the correlation between the sales and the months. To do that, let's create a column which represents the month, going from 1 to 36.

In [32]:
df['month_i'] = np.arange(1, 37, 1)

Then let's find the correlation between the month and the sales.

In [33]:
np.corrcoef(df['month_i'], df['Sales'])
Out[33]:
array([[1.        , 0.85446057],
       [0.85446057, 1.        ]])

The correlation is 0.85 (very strong), we can also see that in the following plot:

In [34]:
fig = px.scatter(data_frame=df, x='month_i', y='Sales',
                title='Correlation between month and Sales',
                template='simple_white'
                )
fig.show()
In [35]:
df.drop('month_i', axis=1, inplace=True)

Note that the correlation coefficient only measures the strength of the linear relationship, and can sometimes be misleading. What we can do now, is do the same, but this time, calculate the correlation between the time series with itself.
To calculate the correlation, we use the original dataset and create shifted datasets to calculate the correlation between then. Pandas has a function to do just that.

In [36]:
df['Sales_lag_1'] = df['Sales'].shift(1)
df['Sales_lag_2'] = df['Sales'].shift(2)
df.head()
Out[36]:
Month Sales Sales_lag_1 Sales_lag_2
0 2001-01-01 266.0 NaN NaN
1 2001-02-01 145.9 266.0 NaN
2 2001-03-01 183.1 145.9 266.0
3 2001-04-01 119.3 183.1 145.9
4 2001-05-01 180.3 119.3 183.1

We then calculate the correlation between the original time series and every lag.

In [37]:
def calculate_autocorr(data, shift):
    autocorrelation = []
    
    for s in range(1, shift):
        
        correlation = np.corrcoef(data[:-s], data[s:])[0,1]
        autocorrelation.append(correlation)
        
    return autocorrelation
In [38]:
calculate_autocorr(df['Sales'].values, 10)
Out[38]:
[0.7194822398024306,
 0.8507433352850972,
 0.7549486594101286,
 0.7988976289305707,
 0.7975335474572063,
 0.7201133465915676,
 0.7449453468917949,
 0.6634723950283556,
 0.687762132398903]
In [39]:
x = np.arange(1, 20, 1)
y = calculate_autocorr(df['Sales'].values, 20)

fig = go.Figure(data=go.Scatter(x=x, y=y, mode='markers'))
fig.update_layout(title_text='Autocorrelation plot',
                    template='simple_white')
fig.show()

What we can see is that with each sucessive shift, the correlation tends to decrease. We can assume that the sales of a month are reasonably similar to the sales of the next month, but the further out you get, the harder is to predict the sales. We can plot the autocorrelation of the time series with Pandas using the funciton auto_correlation_plot. Note that the autocorrelation goes from -1 to 1, being 1 strong positive autocorrelation, -1 strong negative autocorrelation and 0 no autocorrelation.

In [40]:
ax = plt.figure(figsize=(12, 6))
ax.suptitle('Autocorrelation plot', fontsize=18, x=0.25, y=0.95)
autocorrelation_plot(df['Sales'])
Out[40]:
<AxesSubplot:xlabel='Lag', ylabel='Autocorrelation'>

We can also use the Durbin-Watson statistic test for serial correlation. The Durbin-Watson statistic will lie in the 0-4 range, with a value near 2 indicating no first-order serial correlation. Positive serial correlation is associated with DW values below 2 and negative serial correlation with DW values above 2.

In [41]:
sm.stats.durbin_watson(df['Sales'].values)
Out[41]:
0.09548538187456361

The value of 0.0954 indicates that there is a strong evidence that our series has high autocorrelation.